feat: make mesh accept meshcontext by adil-a · Pull Request #2266 · NVIDIA-NeMo/Automodel

adil-a · 2026-05-18T15:19:53Z

What does this PR do?

Refactors the distributed public API so topology and distributed policies are layered explicitly.

The main user-facing object is now DistributedSetup, which owns:

mesh_context: runtime topology and DeviceMesh / MoE mesh access
strategy_config: FSDP2 / Megatron FSDP / DDP strategy config
pipeline_config: pipeline-parallel runtime config
moe_parallel_config: MoE parallelization config
activation_checkpointing: activation-checkpointing policy

MeshContext is narrowed to topology only. It no longer owns activation checkpointing or higher-level training policy.

Changelog

Add DistributedSetup.build(...) as the component-layer entry point for constructing distributed setup from strategy, parallelism sizes, pipeline config, MoE config, and activation checkpointing.
Keep device_mesh compatibility in NeMoAutoModel*.from_pretrained by wrapping raw HF-style meshes into an internal topology-only DistributedSetup.
Remove legacy device_mesh.py and move raw mesh construction/access helpers into mesh_utils.py.
Introduce ParallelismSizes for dp/tp/pp/cp/ep sizing intent.
Move MoEParallelizerConfig into distributed config, since it is part of distributed setup rather than model-only MoE config.
Update recipes to build a single DistributedSetup from YAML/programmatic config and fan out the derived runtime attributes consistently.
Update diffusion, LLM, VLM, KD, retrieval, and sequence-classification callsites to use the new setup layering.
Update tests for the new layering and raw device_mesh compatibility.

API shape

Python usage:

from nemo_automodel.components.distributed import DistributedSetup, FSDP2Config, ParallelismSizes
from nemo_automodel import NeMoAutoModelForCausalLM

distributed_setup = DistributedSetup.build(
    strategy=FSDP2Config(sequence_parallel=True),
    parallelism_sizes=ParallelismSizes(tp_size=2, ep_size=8),
)

model = NeMoAutoModelForCausalLM.from_pretrained(
    "model/name",
    distributed_setup=distributed_setup,
)

HF-compatible raw mesh usage is still allowed:

model = NeMoAutoModelForCausalLM.from_pretrained(
    "model/name",
    device_mesh=device_mesh,
)

Future work

Currently FSDP2Config is not pure FSDP, but also includes options for TP/SP; those will be refactored in a follow-up PR to separate concerns.

Before your PR is "Ready for review"

Pre checks:

Make sure you read and followed Contributor guidelines
Did you write any new necessary tests?
Did you add or update any necessary documentation?

Validation:

python -m ruff check ...
python -m ruff format --check ...
python -m py_compile ...
pytest tests/unit_tests/recipes/test_dist_utils.py -q

Note: local full recipe test collection is blocked in my environment by an existing mlflow / cachetools.func.cached import mismatch. CI should be used for full CPU coverage.

Additional Information

This keeps the TorchTitan-like layering:

sizes: ParallelismSizes
topology: MeshContext
distributed policies and topology bundle: DistributedSetup
recipe/YAML adapter: create_distributed_setup_from_config

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

copy-pr-bot · 2026-05-18T15:19:57Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

adil-a · 2026-05-18T15:20:07Z

/ok to test 3dcadfb

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

akoumpa · 2026-05-18T17:39:12Z

/ok to test a8b2df6

akoumpa · 2026-06-08T20:41:45Z

/ok to test 4a4ba1a

…ntrol (#2444) * feat(speculative): add reasoning mode control for EAGLE/P-EAGLE/DFlash training Add --reasoning {none,save,disable} flag to regenerate.py for controlling whether target model reasoning content is preserved or suppressed during data regeneration. Add mask_reasoning_content option to EAGLE/P-EAGLE/DFlash training recipes to exclude reasoning traces from the loss mask. Co-authored-by: khazic <khazzz1c@gmail.com> Signed-off-by: thyways <2484113689@qq.com> Signed-off-by: khazic <khazzz1c@gmail.com> * feat(speculative): add EAGLE-3 sequence packing for draft training Pack variable-length chat samples into fixed-width rows for EAGLE-3 training, removing the per-sample padding waste of the default max_length path. Documents within a row attend block-causally: the target uses a 4D block-causal mask (SDPA) and the draft uses varlen FlashAttention-2; cross-document TTT supervision is gated by doc_remaining so deeper steps never leak across boundaries. Opt-in via packed_sequence_size > 0, colocated target backend only. Covered by unit tests plus an FA2-vs-eager parity test. Co-authored-by: khazic <khazzz1c@gmail.com> Signed-off-by: thyways <2484113689@qq.com> Signed-off-by: khazic <khazzz1c@gmail.com> --------- Signed-off-by: thyways <2484113689@qq.com> Signed-off-by: khazic <khazzz1c@gmail.com> Co-authored-by: thyways <2484113689@qq.com> Co-authored-by: Huiying <willwin.lee@gmail.com>

jgerh

Completed tech pubs review of docs/guides/gradient-checkpointing.md and provided a few suggestions

…2389) * feat(distributed): add selective activation checkpointing for FSDP2 Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * fix(distributed): support selective activation checkpointing with torch.compile Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * docs(fern): drop selective AC from frozen v0.4 snapshot Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * feat(distributed): honor selective activation checkpointing on single GPU Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * feat(moe): support selective activation checkpointing with expert parallelism Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * fix(model): make DeepSeek MLP dispatch wrapper-safe Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * fix(distributed): save expert grouped-GEMM in selective AC and add op trace Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * feat(moe): compile selective activation checkpointing wrappers outer Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * refactor(distributed): move selective AC into its own module Extract the TorchTitan-style selective activation checkpointing core out of the central parallelizer.py into a dedicated activation_checkpointing.py: op-set construction, the save/recompute policy, block/sub-module wrappers, KV-sharing detection, and the compile-outer wrapper flag. parallelizer.py keeps only the thin apply_selective_activation_checkpointing entry point, which still needs the heavy, transformers-aware _extract_model_layers, so the dependency stays one-directional (parallelizer -> activation_checkpointing -> parallelizer_utils) with no circular imports. Move the opt-in NEMO_SELECTIVE_AC_TRACE diagnostic out of parallelizer.py into parallelizer_utils.maybe_trace_selective_ac_decision so the hot policy is a single call site instead of trace globals plus a helper. Make the new module's cross-module interface public (drop the leading underscore) and keep internal op-resolution/plumbing private. Update the moe and fsdp2 consumers and the unit tests to import from the new module. Also fix doc wording: clarify that torch.compile must be held fixed when comparing full vs. selective, and refer to TorchTitan as a reference implementation rather than "upstream". Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * refactor(distributed): move selective-AC trace into the AC module Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * test(distributed): patch activation_checkpointing.checkpoint_wrapper after AC module split Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * docs: apply tech-writer edits to gradient-checkpointing guide Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> --------- Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>

* ci: add nemo-run, split qwen-vl-utils from decord for arm Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * fix: override in pytorch container Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> * Update uv lock Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> --------- Signed-off-by: Dong Hyuk Chang <9426164+thomasdhc@users.noreply.github.com> Signed-off-by: NeMo Bot <nemo-bot@nvidia.com> Co-authored-by: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Co-authored-by: NeMo Bot <nemo-bot@nvidia.com>

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>

…2419) * fix(transformers): unify loaded HF dtype via promote_types Make _restore_loaded_model_dtype dtype-aware: instead of always restoring to the checkpoint dtype, unify each floating tensor to promote_types(checkpoint, requested). This honors an explicit fp32 request while preserving intrinsically-fp32 checkpoint params (e.g. A_log) under a bf16 request, and is a no-op for the bf16/auto path. Fixes FSDP2 uniform-dtype tripping on HF mixed-dtype loads. Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * feat(distributed): default pipeline dtype to FSDP activation dtype When pipeline parallelism is enabled and pipeline.dtype is unset, derive it from the FSDP mixed-precision activation dtype (mp_policy.output_dtype, falling back to param_dtype) so pipeline stage shape inference matches the real activation dtype (e.g. bf16 compute under fp32 master weights). An explicitly set pipeline.dtype is honored but warned on mismatch, since it can corrupt inter-stage recv buffers. No-ops for strategies without an mp_policy (e.g. MegatronFSDP) and for pp_size==1. Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> (cherry picked from commit 3f6b246) * refactor(distributed): resolve FSDP compute dtype per-param, decoupled from storage fully_shard_by_dtype now groups parameters by their required *compute* dtype instead of their storage dtype, so fp32 master weights (uniform fp32 storage) still compute the bulk in mp_policy.param_dtype (bf16) while intrinsically-fp32 params keep fp32 compute. Per-parameter compute dtype is resolved by precedence: pinned fp32 (_keep_in_fp32_modules_strict) > HF-recorded checkpoint dtype (tagged onto each tensor at load time in _restore_loaded_model_dtype) > mp_policy.param_dtype. Qwen3.5's GatedDeltaNet fp32 holder is declared via patch_hf_model; the NemotronH and Qwen3.5 strategies thread the declaration through. Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> (cherry picked from commit 3dd6b97) * docs(model-onboarding): document _keep_in_fp32_modules_strict contract Add SKILL.md §2.6 explaining which params must compute in fp32 (SSM A_log/ dt_bias/D, MoE sigmoid-gate bias, attention-sink bias, scale), how to declare them (class attribute vs patch_hf_model instance attribute), and why the pin is the robust signal across all load paths. Broaden the MoE checklist item and code comment accordingly. Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> (cherry picked from commit a11db38) * test(distributed): add fp32 compute-dtype contract test Assert the resident compute dtype of every trainable parameter across the model archetypes that use fully_shard_by_dtype (dense, Qwen3.5-style hybrid), covering the full precedence chain: pinned fp32 > HF-recorded dtype > mp_policy.param_dtype, under fp32 master weights and ordinary loads. Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> (cherry picked from commit dc83926) * feat(model): cast frozen modules to compute dtype to avoid mismatch Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> (cherry picked from commit d321f5e) * refactor(gemma4): drop projector dtype hook now general frozen cast handles it Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> (cherry picked from commit 1bc67e2) * feat(training): add dormant resolve_storage_dtype helper Add resolve_storage_dtype() (and its unit tests) for defaulting model.torch_dtype to fp32 for full-parameter torch.optim training. Not yet wired into recipes here; the call sites are marked with breadcrumb comments and enabled in a follow-up PR, keeping this PR limited to dtype bug fixes with no behavior/memory change. Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * fix(model): cast frozen-module buffers and unsharded params to compute dtype Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * docs(infra): correct frozen-tower FSDP comment to match sharding reality Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * docs(mixed-precision): clarify TE vs torch AdamW memory and precision trade-offs Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * docs(mixed-precision): apply tech writer edits Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> * docs(mixed-precision): drop unresolvable FSDP anchor Signed-off-by: Yuhe Zhang <yuhez@nvidia.com> --------- Signed-off-by: Yuhe Zhang <yuhez@nvidia.com>

…2448) Add examples/speculative/README.md covering the whole speculative-decoding draft-training subsystem: supported methods (EAGLE-1/2/3/3.1, P-EAGLE, DFlash), target-model registry coverage, compute backends (eager vs flash_attention_2, flex_attention/sdpa, fused Triton soft cross-entropy, d2t/t2d draft-vocab compression), target backends (co-located, remote, offline cache), serving and benchmarking, inference-engine compatibility, and a consolidated config reference. Fold the standalone regenerate_with_target.md into the README's data preparation section (full two-step flow, tuning table, pitfalls) and remove the separate file so there is a single entry point. Signed-off-by: khazic <khazzz1c@gmail.com>

) * feat(diffusion): add Wan2.2 T2V-A14B two-stage finetuning support Signed-off-by: linnan wang <linnanw@nvidia.com> * fix the memory management for training large 14B wan model * fix wan2.2 support * all good for wan2.2 * update Signed-off-by: linnan wang <linnanw@nvidia.com> * docs(fern): add Wan2.2 T2V-A14B model coverage and release log entry Signed-off-by: linnan wang <linnanw@nvidia.com> * fix anther round of code review * fix(diffusion): sort wan.py imports to satisfy CI isort (I001) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * fix(diffusion): load inference checkpoints to CPU to halve peak GPU memory Avoids doubling peak GPU memory (and a potential OOM in Wan2.2 two-stage inference) by loading EMA/consolidated state dicts with map_location="cpu"; load_state_dict copies into the already-on-device parameters. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com> --------- Signed-off-by: linnan wang <linnanw@nvidia.com> Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>

Resolve conflicts between the MeshContext/DistributedSetup refactor and main's selective activation checkpointing (#2389), FSDP2 dtype fixes (#2419), and DDP find_unused_parameters: - config.py: keep the DistributedSetup/MoEParallelizerConfig refactor and the DistributedStrategyConfig rename; fold in ActivationCheckpointingMode + a back-compat DistributedConfig alias; widen DistributedSetup.activation_checkpointing; DDPConfig gains find_unused_parameters and drops backend. - mesh.py: MeshContext stays pure topology (strategy/pipeline/moe/AC fields removed); main's AC-type change there is moot. - infrastructure.py: keep moe_parallel_config param + cast_frozen_modules import; drop the relocated moe.config MoEParallelizerConfig import; widen activation_checkpointing. - ddp.py / diffusion: preserve find_unused_parameters via DDPConfig, drop backend. - multimodal/finetune.py: fix moe_config= -> moe_parallel_config= to match the API. - tests: align dist_utils + diffusion DDP tests with the new DistributedSetup API.

Pull in Wan2.2 two-stage finetuning (#2284). The only conflict was the diffusion FSDP2 manager_args build: keep the PR's _build_diffusion_parallel_manager_args helper and teach it to honor fsdp.cpu_offload -> CPUOffloadPolicy so #2284's CPU-offload support is preserved through the refactored path.

akoumpa · 2026-06-09T02:29:39Z

/ok to test 7a55fc1

The DDP strategy config exposes find_unused_parameters (default False), so _build_diffusion_parallel_manager_args returns it in the ddp branch. Update the test's expected dict to match, fixing the L0 unit test failure. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi · 2026-06-09T08:33:33Z

/claude review

HuiyingLi · 2026-06-09T08:33:34Z

/ok to test 42b703f

claude · 2026-06-09T08:39:08Z

+        if moe_parallel_config is None:
+            moe_parallel_config = MoEParallelizerConfig()
+        parallelize_fn = partial(
+            parallelize_model,
+            activation_checkpointing=activation_checkpointing,
+            **moe_parallel_config.to_dict(),


Bug: the old code forwarded model_wrapper.mp_policy (from FSDP2Config) to the MoE parallelizer when MoEParallelizerConfig.mp_policy was None:

# old code moe_kwargs = moe_config.to_dict() if moe_kwargs.get("mp_policy") is None and model_wrapper is not None: moe_kwargs["mp_policy"] = getattr(model_wrapper, "mp_policy", None)

This ensured that a custom mp_policy on FSDP2Config (e.g. fp16 or custom reduce_dtype) propagated to expert sharding. The new code doesn't forward it — MoEParallelizerConfig.mp_policy defaults to None, and the MoE parallelizer falls back to its own hardcoded bf16/fp32 default.

For the default config this is identical (both default to bf16/fp32), but for users passing a custom mp_policy on FSDP2Config with EP models, the MoE sharding will silently ignore their precision choice. Consider restoring the forwarding:

Suggested change

if moe_parallel_config is None:

moe_parallel_config = MoEParallelizerConfig()

parallelize_fn = partial(

parallelize_model,

activation_checkpointing=activation_checkpointing,

**moe_parallel_config.to_dict(),

if moe_parallel_config is None:

moe_parallel_config = MoEParallelizerConfig()

moe_kwargs = moe_parallel_config.to_dict()

if moe_kwargs.get("mp_policy") is None and model_wrapper is not None:

moe_kwargs["mp_policy"] = getattr(model_wrapper, "mp_policy", None)

parallelize_fn = partial(

parallelize_model,

activation_checkpointing=activation_checkpointing,

**moe_kwargs,

)

claude · 2026-06-09T08:39:17Z

+components/distributed/mesh.py
+    MeshContext  -- strategy_config, device_mesh, moe_mesh, pipeline_config, moe_config
+                    Properties: tp_size, pp_size, cp_size, ep_size, dp_size, dp_replicate_size
+    STRATEGY_MAP -- {"fsdp2": FSDP2Config, "megatron_fsdp": MegatronFSDPConfig, "ddp": DDPConfig}
+    MeshAxisName -- PP, DP, DP_REPLICATE, DP_SHARD, DP_SHARD_CP, DP_CP, CP, TP, EP, EP_SHARD


Stale documentation: this block describes the pre-refactor MeshContext. After this PR:

MeshContext no longer has strategy_config, pipeline_config, or moe_config — those moved to DistributedSetup.

STRATEGY_MAP was removed from mesh.py — it's now _STRATEGY_MAP in config.py.

Suggested change

components/distributed/mesh.py

MeshContext -- strategy_config, device_mesh, moe_mesh, pipeline_config, moe_config

Properties: tp_size, pp_size, cp_size, ep_size, dp_size, dp_replicate_size

STRATEGY_MAP -- {"fsdp2": FSDP2Config, "megatron_fsdp": MegatronFSDPConfig, "ddp": DDPConfig}

MeshAxisName -- PP, DP, DP_REPLICATE, DP_SHARD, DP_SHARD_CP, DP_CP, CP, TP, EP, EP_SHARD

components/distributed/mesh.py

MeshContext -- device_mesh, moe_mesh

Properties: tp_size, pp_size, cp_size, ep_size, dp_size, dp_replicate_size

MeshAxisName -- PP, DP, DP_REPLICATE, DP_SHARD, DP_SHARD_CP, DP_CP, CP, TP, EP, EP_SHARD

claude · 2026-06-09T08:39:23Z

+```
+components/moe/config.py
+    MoEParallelizerConfig  -- reshard_after_forward, ignore_router_for_ac, wrap_outer_model, etc.


Stale path: MoEParallelizerConfig was moved to components/distributed/config.py in this PR.

Suggested change

```

components/moe/config.py

MoEParallelizerConfig -- reshard_after_forward, ignore_router_for_ac, wrap_outer_model, etc.

components/distributed/config.py

MoEParallelizerConfig -- reshard_after_forward, ignore_router_for_ac, wrap_outer_model, etc.

components/moe/config.py

MoEConfig -- n_routed_experts, n_activated_experts, score_func, etc.

HuiyingLi · 2026-06-09T08:56:55Z

/claude review

HuiyingLi · 2026-06-09T08:56:56Z

/ok to test 300109d

Resolve conflicts from main (14 commits) against the distributed-config refactor. Key resolutions: - backend: keep the PR's removal of the configurable per-strategy `backend` (DDPConfig has no backend; managers don't take it; tests assert its absence). backend remains a process-group concern (dist_env / init). - config.py / mesh.py / infrastructure.py: keep the PR's DistributedSetup/MeshContext structure and moe_parallel_config naming. - activation checkpointing: keep the PR's design (carried on the parsed value and injected onto the strategy config later via infrastructure._with_activation_checkpointing, not in parse_distributed_section). Deduped a merge-duplicated _normalize_activation_checkpointing; updated the two selective-AC tests from main to assert the PR's behavior (AC stays off strategy_config). - skills model-onboarding SKILL.md: take main's new "Declare model capabilities" section. moe/parallelizer.py: take main's _moe_shard_placement helper (it is used). _dist_utils.py: drop main's unused `import dataclasses`. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi · 2026-06-09T09:52:03Z

/ok to test 994ad67

- infrastructure.py: forward the model wrapper's mp_policy (from FSDP2Config) to the MoE expert parallelizer when MoEParallelizerConfig.mp_policy is unset, so a custom precision policy isn't silently dropped for EP models. - skills/nemo-automodel-distributed-training/SKILL.md: fix stale references — MeshContext no longer holds strategy_config/pipeline_config/moe_config and STRATEGY_MAP moved to _STRATEGY_MAP in config.py; MoEParallelizerConfig now lives in components/distributed/config.py. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> Signed-off-by: HuiyingLi <willwin.lee@gmail.com>

HuiyingLi · 2026-06-09T11:54:51Z

/claude review

HuiyingLi · 2026-06-09T11:54:52Z

/ok to test bbd2d61

claude

LGTM

Clean, well-structured refactoring that consolidates distributed setup into a single DistributedSetup object. The new layering (topology in MeshContext, policies in DistributedSetup) is clear and consistent across all recipe callsites. Test coverage is thorough — all major new code paths (DistributedSetup.build(), _resolve_distributed_setup(), _reject_separate_distributed_kwargs(), the backend removal, MegatronFSDP aliases) have dedicated tests. Skill documentation is updated to match the new API. No bugs, logic errors, or typos found.

make mesh accept meshcontext

3dcadfb

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

adil-a requested review from HuiyingLi, ZhiyuLi-Nvidia, akoumpa, athitten, hemildesai, pthombre and zyzhou5 as code owners May 18, 2026 15:19

copy-pr-bot Bot temporarily deployed to nemo-ci May 18, 2026 15:20 Inactive

copy-pr-bot Bot temporarily deployed to test May 18, 2026 15:20 Inactive

copy-pr-bot Bot temporarily deployed to public May 18, 2026 15:20 Inactive

copy-pr-bot Bot temporarily deployed to public May 18, 2026 15:23 Inactive

copy-pr-bot Bot temporarily deployed to public May 18, 2026 15:24 Inactive

copy-pr-bot Bot temporarily deployed to nemo-ci May 18, 2026 15:25 Inactive

copy-pr-bot Bot temporarily deployed to public May 18, 2026 15:30 Inactive

fix(transformers): resolve mesh context inputs

a8b2df6

Signed-off-by: Alexandros Koumparoulis <akoumparouli@nvidia.com>

copy-pr-bot Bot temporarily deployed to nemo-ci May 18, 2026 17:39 Inactive

jgerh reviewed Jun 8, 2026

View reviewed changes

yuhezhang-ai and others added 9 commits June 8, 2026 15:10

feat(diffusion): improve qwen image finetuning configs (#2442)

e6302ad

Signed-off-by: Pranav Prashant Thombre <pthombre@nvidia.com> Co-authored-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>

Apply suggestions from code review

6d52cf4

Co-authored-by: jgerh <163925524+jgerh@users.noreply.github.com> Signed-off-by: Alexandros Koumparoulis <153118171+akoumpa@users.noreply.github.com>

claude Bot reviewed Jun 9, 2026

View reviewed changes

claude Bot approved these changes Jun 9, 2026

View reviewed changes

thomasdhc approved these changes Jun 9, 2026

View reviewed changes

pthombre mentioned this pull request Jun 11, 2026

fix(diffusion): resolve flux nightly CI failures #2529

Merged

3 tasks

edjson mentioned this pull request Jun 11, 2026

refactor: Remove separate moe_mesh references #2123

Closed

3 tasks

Conversation

adil-a commented May 18, 2026 • edited by akoumpa Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Changelog

API shape

Future work

Before your PR is "Ready for review"

Additional Information

Uh oh!

copy-pr-bot Bot commented May 18, 2026

Uh oh!

adil-a commented May 18, 2026

Uh oh!

akoumpa commented May 18, 2026

Uh oh!

akoumpa commented Jun 8, 2026

Uh oh!

jgerh left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

akoumpa commented Jun 9, 2026

Uh oh!

HuiyingLi commented Jun 9, 2026

Uh oh!

HuiyingLi commented Jun 9, 2026

Uh oh!

claude Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

claude Bot Jun 9, 2026

Choose a reason for hiding this comment

Uh oh!

HuiyingLi commented Jun 9, 2026

Uh oh!

HuiyingLi commented Jun 9, 2026

Uh oh!

HuiyingLi commented Jun 9, 2026

Uh oh!

HuiyingLi commented Jun 9, 2026

Uh oh!

HuiyingLi commented Jun 9, 2026

Uh oh!

claude Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

9 participants

adil-a commented May 18, 2026 •

edited by akoumpa

Loading